L20: Backtesting - data prep

# Import packages
import pandas as pd
pd.options.display.max_rows = 20

Lecture Overview

We will spend the following four lectures working through backtesting the asset-growth anomaly. Here are the overall steps:

Data
- Load “comp_clean” put filters on and create (some of the) variables in Table 1
- Do the same for “crsp_clean”
- Merge the two datasets above
Table 1
- Create AG deciles
- Summarize selected firm characteristics at the AG-decile level
Create portfolios
- Merge AG deciles with monthly returns data
- Create variable that tells us which stock is in which portfolio at all times
Table 2
- EW portfolio returns
- VW portfolio returns
- Risk-adjusted EW portoflio returns
- Risk-adjusted VW portfolio returns

Data

Accounting data (Compustat)

Data cleaning

# Load cleaned Compustat data
raw = pd.read_pickle('../data/compa.zip')
raw.shape

(278226, 23)

# Keep only what we need
comp = raw[['permno','datadate','at','revt','cogs','dltt','dlc','ib']].copy()
comp.dtypes

permno      float64
datadate     object
at          float64
revt        float64
cogs        float64
dltt        float64
dlc         float64
ib          float64
dtype: object

# Clean firmid and dates
comp['dtdate'] = pd.to_datetime(comp['datadate'])
comp['year'] = comp['dtdate'].dt.year
comp['permno'] = comp['permno'].astype('int64')
comp.dtypes

permno               int64
datadate            object
at                 float64
revt               float64
cogs               float64
dltt               float64
dlc                float64
ib                 float64
dtdate      datetime64[ns]
year                 int64
dtype: object

# Keep postive total assets, sort and drop duplicates
comp = comp[comp['at']>0].copy()
comp = comp.sort_values(['permno','year'])
comp = comp.drop_duplicates(['permno','year'], keep='last', ignore_index=True)
comp.head()

	permno	datadate	at	revt	cogs	dltt	dlc	ib	dtdate	year
0	10000	1986-10-31	2.115	1.026	0.511	0.058	0.968	-0.730	1986-10-31	1986
1	10001	1986-06-30	12.242	21.460	19.565	2.946	0.343	0.669	1986-06-30	1986
2	10001	1987-06-30	11.771	16.621	15.538	2.750	0.377	0.312	1987-06-30	1987
3	10001	1988-06-30	11.735	16.978	15.556	2.555	0.325	0.542	1988-06-30	1988
4	10001	1989-06-30	18.565	22.910	19.856	7.370	0.185	1.208	1989-06-30	1989

comp.shape

(236316, 10)

New variables

# Calculate some of the variables in Table 1
comp['at_lag1'] = comp.groupby('permno')['at'].shift(1)
comp['AG'] = comp['at'] / comp['at_lag1'] - 1
comp['L2AG'] = comp.groupby('permno')['AG'].shift(1)

comp['Leverage'] = (comp['dltt'] + comp['dlc']) / comp['at_lag1'] 
comp['ROA'] = comp['ib'] / comp['at']

# Keep only the variables we need, and observations with non-missing AG
comp = comp.loc[comp['AG'].notnull() ,:].copy()
comp_cgs = comp[['permno','year','dtdate','AG','L2AG', 'at','Leverage', 'ROA']].copy()
comp_cgs.head()

	permno	year	dtdate	AG	L2AG	at	Leverage	ROA
2	10001	1987	1987-06-30	-0.038474	NaN	11.771	0.255432	0.026506
3	10001	1988	1988-06-30	-0.003058	-0.038474	11.735	0.244669	0.046187
4	10001	1989	1989-06-30	0.582020	-0.003058	18.565	0.643801	0.065069
5	10001	1990	1990-06-30	0.017021	0.582020	18.881	0.396984	0.059901
6	10001	1991	1991-06-30	0.038028	0.017021	19.599	0.380012	0.054748

comp_cgs.shape

(212504, 8)

# Save this for later use
comp_cgs.to_pickle('../data/comp_cgs.zip')

Market data (CRSP)

Data Cleaning

# Load CRSP data
crsp = pd.read_pickle('../data/crspm.zip')
crsp.shape

(2553287, 10)

crsp.dtypes

permno    float64
permco    float64
date       object
prc       float64
ret       float64
shrout    float64
shrcd     float64
exchcd    float64
siccd     float64
ticker     object
dtype: object

# Keep only what we need
crsp = crsp[['permno','date','ret','prc','shrout','siccd']].copy().dropna()
crsp.shape

(2530665, 6)

# Get rid of financials (first digit of siccd = 6)
crsp = crsp.loc[crsp['siccd'].astype('string').str[0] != '6', :].copy()
crsp.shape

(2048105, 6)

# Clean up firmid and dates
crsp['dtdate'] = pd.to_datetime(crsp['date'])
crsp['mdate'] = crsp['dtdate'].dt.to_period('M')
crsp['permno'] = crsp['permno'].astype('int64')

# Drop duplicates and sort
crsp = crsp.sort_values(['permno','mdate'])
crsp = crsp.drop_duplicates(['permno','mdate'], keep='last',ignore_index=True)
crsp.shape

(2048105, 8)

# Calculate market capitalization (in millions)
crsp['MV'] = crsp['prc'].abs() * crsp['shrout'] / 1000

# Calculate lagged market cap
crsp.sort_values(['permno','mdate'], inplace=True)
crsp['mktcap_lag1'] = crsp.groupby('permno')['MV'].shift(1)

# Keep only the variables we need
crsp_cgs = crsp[['permno','mdate','dtdate','ret','MV','mktcap_lag1']].copy()
crsp_cgs.head(2)

	permno	mdate	dtdate	ret	MV	mktcap_lag1
0	10000	1986-02	1986-02-28	-0.257143	11.96	NaN
1	10000	1986-03	1986-03-31	0.365385	16.33	11.96

# Save for later use
crsp_cgs.to_pickle('../data/crsp_cgs.zip')

Calculate CRSP variables for Table 1

# Calculate cummulative returns over the past 6 months (BHRET6)
crsp['BHRET6'] = 1 
for i in range(0,6):
    crsp['BHRET6'] = crsp['BHRET6'] * (1 + crsp.groupby('permno')['ret'].shift(i))  
    
crsp['BHRET6'] = crsp['BHRET6'] - 1         
crsp['BHRET6'].describe()

count    1.957695e+06
mean     6.840459e-02
std      5.437771e-01
min     -9.999818e-01
25%     -1.941748e-01
50%      1.254532e-02
75%      2.265447e-01
max      6.691429e+01
Name: BHRET6, dtype: float64

Challenge:

Write a function called compound that calculates cumulative returns over a sequence of months and adds them as another column in the crsp dataframe (i.e. exactly what we did above, with BHRET6 but for arbitrary number of lags, not just 6).

def compound(dset=None, outvar=None, 
             firmid='permno', datevar='mdate', retvar='ret',
             startlag=None, endlag=None):
    
    dset.sort_values([firmid,datevar], inplace=True) #side effect
    
    dset[outvar] = 1
    for i in range(startlag,endlag+1):    
        dset[outvar] = dset[outvar] * (1 + dset.groupby(firmid)[retvar].shift(i))
        
    dset[outvar] = dset[outvar] - 1
    return

compound(crsp,'mybhret6',startlag=0, endlag=5)
crsp.head(2)

	permno	date	ret	prc	shrout	siccd	dtdate	mdate	MV	mktcap_lag1	BHRET6	mybhret6
0	10000	1986-02-28	-0.257143	-3.2500	3680.0	3990.0	1986-02-28	1986-02	11.96	NaN	NaN	NaN
1	10000	1986-03-31	0.365385	-4.4375	3680.0	3990.0	1986-03-31	1986-03	16.33	11.96	NaN	NaN

compound(crsp,'BHRET36',startlag=0, endlag=35)
crsp.head(2)

	permno	date	ret	prc	shrout	siccd	dtdate	mdate	MV	mktcap_lag1	BHRET6	mybhret6	BHRET36
0	10000	1986-02-28	-0.257143	-3.2500	3680.0	3990.0	1986-02-28	1986-02	11.96	NaN	NaN	NaN	NaN
1	10000	1986-03-31	0.365385	-4.4375	3680.0	3990.0	1986-03-31	1986-03	16.33	11.96	NaN	NaN	NaN

Challenge:

Create a new variable called FBHRET12 which equals the (net) cumulative returns in the 12 months FOLLOWING the current month (t+1 to t+12 inclusive).

compound(crsp,'FBHRET12',startlag = -12, endlag = -1)
crsp.head(2)

	permno	date	ret	prc	shrout	siccd	dtdate	mdate	MV	mktcap_lag1	BHRET6	mybhret6	BHRET36	FBHRET12
0	10000	1986-02-28	-0.257143	-3.2500	3680.0	3990.0	1986-02-28	1986-02	11.96	NaN	NaN	NaN	NaN	-0.875000
1	10000	1986-03-31	0.365385	-4.4375	3680.0	3990.0	1986-03-31	1986-03	16.33	11.96	NaN	NaN	NaN	-0.943662

# Keep only the variables we need we need for Table 1
crsp_table1 = crsp[['permno','mdate','MV','BHRET6','BHRET36','FBHRET12']].copy()
crsp_table1.head(7)

	permno	mdate	MV	BHRET6	BHRET36	FBHRET12
0	10000	1986-02	11.960000	NaN	NaN	-0.875000
1	10000	1986-03	16.330000	NaN	NaN	-0.943662
2	10000	1986-04	15.172000	NaN	NaN	-0.941406
3	10000	1986-05	11.793859	NaN	NaN	-0.929648
4	10000	1986-06	11.734594	NaN	NaN	NaN
5	10000	1986-07	10.786344	-0.350000	NaN	NaN
6	10000	1986-08	4.148594	-0.663462	NaN	NaN

# Save for later use
crsp_table1.to_pickle('../data/crsp_cgs_table1.zip')